I did a search on pixabay for the term: fighter jets. I have always been fascinated by aircraft and especially military aircraft. So, I took the opportunity to look at some cool pictures of planes.
I noticed some features that seemed to generally apply to the photos:
photo_data %>%
select(`Picture Links` = pageURL) %>%
knitr::kable()
# Range of view counts
view_range <- photo_data %>%
pull(views) %>%
range()
Before we analysed the data we observed that the photos had a wide variety of view counts and upon inspection of the data we find that from our selected photos, the one with the least views had 340 views and the one with the most views had 103783 views.
# Number of users
num_users <- photo_data %>%
pull(user) %>%
unique() %>%
length()
Each photo in pixabay has a user associated with it and our selection of photos are associated with 30 different users.
# Middle 75% of download view percentage
mid_75_down_view_percentage <- photo_data %>%
pull(download_view_proportion) %>%
quantile(c(0.125, 0.875)) * 100
From my investigation I found that the percentage of people who downloaded a photo after viewing it was between 45.3% and 75.8% for 75% of the photos.
During analysis I assigned a category (Very Popular, Popular, Average, Unpopular) to each photo each photo according to their number of views on a logarithmic scale. I was curious about whether photos with more views were more likely to get liked so I calculated a “like score” for each photo that accounted for the number of views that photo got. Finally I calculated the mean “like score” for each popularity category.
# Mean likes score (adjusted for views) by popularity
likes_by_pop <- photo_data %>%
mutate(like_adj = likes / views * 1e4) %>%
group_by(popularity) %>%
summarise(mean_likes = mean(like_adj))
# Output summary table
likes_by_pop %>%
rename(Popularity = popularity,
`Mean Like Score` = mean_likes) %>%
knitr::kable()
| Popularity | Mean Like Score |
|---|---|
| Average | 45.41466 |
| Popular | 23.49212 |
| Unpopular | 121.58874 |
| Very Popular | 18.50014 |
These results were surprising as the Unpopular category got the highest mean like score and the Very Popular category got the lowest mean like score.
To demonstrate creativity I had a look at what tags were used to describe the selected photos. Here is a table showing some information about the 25 most popular tags.
# Separate tags into different rows
tags <- photo_data %>%
separate_rows(tags, sep = ", ")
# Interesting table
tags_table <- tags %>%
group_by(tags) %>%
summarise(tag_count = n(),
mean_views = mean(views),
mean_downloads = mean(downloads),
mean_likes = mean(likes),
largest = paste0(round(max(imageSize) / 1e6, 2), "MB")) %>%
arrange(desc(tag_count))
tags_table %>%
rename(Tags = tags, Count = tag_count, `Mean Views` = mean_views, `Mean Downloads` = mean_downloads, `Mean Likes` = mean_likes, `Largest Photo Size` = largest) %>%
slice(1:25) %>%
knitr::kable()
| Tags | Count | Mean Views | Mean Downloads | Mean Likes | Largest Photo Size |
|---|---|---|---|---|---|
| fighter jet | 14 | 13322.6429 | 8066.4286 | 32.071429 | 3.27MB |
| jet | 13 | 12515.1538 | 8453.2308 | 33.615385 | 2.91MB |
| plane | 11 | 1553.2727 | 841.9091 | 8.818182 | 3.28MB |
| aircraft | 10 | 5592.2000 | 3559.2000 | 14.900000 | 3.23MB |
| military | 7 | 1821.8571 | 1088.1429 | 8.428571 | 2.91MB |
| air force | 4 | 709.5000 | 451.2500 | 5.000000 | 2.64MB |
| aviation | 4 | 5107.2500 | 3231.7500 | 16.500000 | 2.75MB |
| saab jas 39 gripen | 4 | 534.7500 | 348.7500 | 3.500000 | 3MB |
| airplane | 3 | 4669.6667 | 2072.0000 | 13.000000 | 1.67MB |
| fighter | 3 | 8936.3333 | 4672.3333 | 19.666667 | 3.23MB |
| fighter plane | 3 | 21316.3333 | 9852.6667 | 48.000000 | 3.12MB |
| flight | 3 | 869.6667 | 609.6667 | 6.000000 | 2.63MB |
| armed forces | 2 | 3215.0000 | 2245.5000 | 19.000000 | 3.27MB |
| engine | 2 | 5041.5000 | 2004.0000 | 13.500000 | 2.86MB |
| f a-18 hornet | 2 | 8069.0000 | 3279.0000 | 19.500000 | 1.64MB |
| f-18 | 2 | 12979.0000 | 7154.5000 | 25.500000 | 2.12MB |
| fighter aircraft | 2 | 572.0000 | 374.5000 | 3.000000 | 3MB |
| jet fighter | 2 | 659.5000 | 397.0000 | 8.000000 | 1.83MB |
| panavia | 2 | 3215.0000 | 2245.5000 | 19.000000 | 3.27MB |
| ships | 2 | 21009.5000 | 14728.5000 | 77.500000 | 2.82MB |
| stealth | 2 | 6536.0000 | 4474.0000 | 34.000000 | 1.63MB |
| 3d rendered | 1 | 2654.0000 | 1888.0000 | 13.000000 | 2.57MB |
| aeroplane | 1 | 8015.0000 | 4179.0000 | 14.000000 | 2.13MB |
| aerospace | 1 | 1831.0000 | 1121.0000 | 10.000000 | 2.41MB |
| air show | 1 | 38689.0000 | 18654.0000 | 71.000000 | 1.67MB |
To give a better overview of the variety of tags used I revisited the ggwordcloud package I used in Project 2.
# Create word cloud
tags_table %>%
ggplot() +
geom_text_wordcloud(aes(label = tags, size = tag_count, colour = tag_count)) +
scale_size_area(max_size = 13) +
theme_minimal()
I believe that I demonstrated creativity by using the separate_rows function to separate the tags information into separate rows. I then used this information to create a table with interesting information about each of the 25 most popular tags. I also investigated the ggwordcloud package to create an interesting visual to represent more information about the use of tags to describe the chosen photos.
I would like to disclose that I reused the css code form my previous projects, although I did add some formatting to the images in this project that was not present in previous projects.
I really enjoyed this module and learned about a lot of important ideas. Firstly, I really liked learning about APIs and using them to obtain data. APIs make it easy to obtain relevant and current data for analysis. We also learned about JSON data and how to work with it. Finally, I developed more skills with regards to the manipulation of data frames to get the information that I want from them.
I am really interested to learn more about different data sources and expand my knowledge of APIs. This would allow me to easily gather data to answer questions that I might have.
# Load packages and data
library(tidyverse)
library(jsonlite)
library(magick)
library(ggwordcloud)
json_data <- fromJSON("pixabay_data.json")
pixabay_photo_data <- json_data$hits
# Filter and select some of the photos
# Get list of column names
names(pixabay_photo_data)
# Keep only some variables, create new variables, and filter images to reduce number
# download_view_proportion is the proportion of people who downloaded an image after viewing it
# popularity ranks images logarithmically into 4 categories according to views
# A popular user is onkelglocke so is_onkelglocke checks if an image's user is onkelglocke
# We filter first by imageWidth and imageHeight to keep only photos of better than 4k resolution
# Since there were still to many photos we filtered to keep only half of the images with imageSize less than the median
selected_photos <- pixabay_photo_data %>%
select(previewURL, pageURL, tags, imageWidth, imageHeight, imageSize, views, downloads, collections, likes, comments, user) %>%
mutate(download_view_proportion = downloads / views,
popularity = ifelse(views >= 1e5, "Very Popular",
ifelse(views < 1e5 & views >= 1e4, "Popular",
ifelse(views < 1e4 & views >= 1e3, "Average",
"Unpopular"))),
is_onkelglocke = (user == "onkelglocke")) %>%
filter(imageWidth >= 3840 &
imageHeight >= 2160) %>%
filter(imageSize < median(imageSize))
# Save new dataframe as csv file
write_csv(selected_photos, "selected_photos.csv")
# List most popular users for selected photos
selected_photos %>%
group_by(user) %>%
summarise(n()) %>%
arrange(desc(`n()`))
# Data exploration and summary values
glimpse(selected_photos)
# Some exploratory plots
selected_photos %>%
ggplot() +
geom_point(aes(x = download_view_proportion, y = likes,),
colour = "#50c4f2")
selected_photos %>%
ggplot() +
geom_point(aes(x = is_onkelglocke, y = download_view_proportion),
colour = "#50c4f2")
selected_photos %>%
ggplot() +
geom_point(aes(x = popularity, y = download_view_proportion),
colour = "#50c4f2")
selected_photos %>%
ggplot() +
geom_point(aes(x = collections, y = log(views)),
colour = "#50c4f2")
selected_photos %>%
ggplot() +
geom_point(aes(x = views, y = likes),
colour = "#50c4f2")
# mean likes score (adjusted for views) by popularity
# Do higher popularity photos (logarithmically according to views) get more likes once we account for the effect of views?
selected_photos %>%
mutate(like_adj = likes / views * 1e4) %>%
group_by(popularity) %>%
summarise(mean_likes = mean(like_adj))
# No, there does not seem to be such an effect
# Range of views
view_range <- selected_photos %>%
pull(views) %>%
range()
# Number of different users
num_users <- selected_photos %>%
pull(user) %>%
unique() %>%
length()
# Number of photos by onkelglocke
selected_photos %>%
pull(is_onkelglocke) %>%
sum()
# 75% of the selected photos have a percentage of downloads per views between these values
mid_75_down_view_percentage <- selected_photos %>%
pull(download_view_proportion) %>%
quantile(c(0.125, 0.875)) * 100 %>%
round()
# Create animated GIF
photos <- selected_photos %>%
pull(previewURL) %>%
image_read()
animated_photos <- image_animate(photos, fps = 1)
animated_photos %>%
image_write("my_photos.gif")
# Tags Analysis
# Separate tags into different rows
tags <- selected_photos %>%
separate_rows(tags, sep = ", ")
# Arrange tags according to how frequently they are used
popular_tags <- tags %>%
group_by(tags) %>%
summarise(cnt = n()) %>%
arrange(desc(cnt)) %>%
pull(tags)
# Five most popular tags
popular_tags[1:5]
# How many unique tags
length(popular_tags)
# Bar chart of the five most popular tags
tags %>%
filter(tags %in% popular_tags[1:5]) %>%
ggplot() +
geom_bar(aes(x = tags), fill = "#50c4f2")
# Interesting table
tags_table <- tags %>%
group_by(tags) %>%
summarise(tag_count = n(),
mean_views = mean(views),
mean_downloads = mean(downloads),
mean_likes = mean(likes),
largest = paste0(round(max(imageSize) / 1e6, 2), "MB")) %>%
arrange(desc(tag_count))
# Try word cloud
tags_table %>%
ggplot() +
geom_text_wordcloud(aes(label = tags, size = tag_count, colour = tag_count)) +
scale_size_area(max_size = 6) +
theme_minimal()